Basic Graphics with ggplot2

Ethan Fosse
March 1, 2017

Research Associate, Department of Sociology

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Workshop Preliminaries

  1. Workshop Requirements
  2. Installing and loading ggplot2
  3. Our Research Question
  4. Roadmap for the Workshop

Two-Part Workshop

  • This is essentially a two-part workshop
  • In Part 1 (March 1) we will cover the basic commands of ggplot, which are bundled into the qplot() function (which stands for “quick plot”)
  • In Part 2 (March 8) we will cover more advanced functions in ggplot, which allow for greater flexibility when creating visualizations

1. Workshop Requirements

Before You Begin

  1. You have access to a laptop computer and Internet service
  2. You have downloaded and installed R with RStudio
  3. You have downloaded the R Workshop files:
    • States.RData
    • ggplot2part1.html

Go to: https://compass-workshops.github.io/info/

2. Installing and Loading ggplot2

The Power of R and R Packages

  • Since R is free and open-source, lots of people are writing R packages
  • An R package is just a collection of R functions, data, and code in a well-defined format
  • R packages can do everything from text mining (tm package) to data visualzation (ggplot2 package) to data wrangling (dplyr package)
  • Over 9,000 R packages as of 2016 on CRAN (Comprehensive R Archive Network)

Installing Packages in RStudio

  • To install R packages in RStudio: GUI versus R Console
  • 1. Using the GUI: Go to the Packages tab and click Install
  • 2. Using the R Console: install.packages("package_name")
    • package_name is just the name of the R package in quotes
    • Try this R Code: install.packages("ggplot2")

Loading an R Package For Use

  • Once you've installed an R package, it's then bundled with R and RStudio
  • You now have access to all of the functions, data, code, and other files associated with the installed R package
  • However, to access these files you must load your R package
    • Try this R Code: library(ggplot2)
  • You only have to install an R package once, but you must load it every time you start R

3. Our Research Question

The 2008 Presidential Election

Obama and McCain

The Electoral College Map

Obama and McCain Map

Examining the Election of 2008

  • Throughout this workshop we will consider the following question:
    • Why did some states go for Obama and others for McCain?
  • We'll explore this question using real data on 50 states
  • In the process of answering this question we'll learn about visualizing data with ggplot2

4. Roadmap for the Workshop

What We'll Learn Today

  • Part 1: Histograms and Density Plots
  • Part 2: Bar Plots and Faceting
  • Part 3: Boxplots
  • Part 4: Scatterplots
  • Part 5: Spatial Mapping

Part 1: Histograms and Density Plots

Mission #1: Educational Differences in the United States

  • How does the percent who went to high school differ across regions?
  • Later we'll examine if education might help us understand variation in voting across the United States

  • We'll use data to answer this question!

Loading Data into R

  • Let's load States.RData into our workspace
  • Using RStudio's user-friendly interface:

    1. File —> Open File
    2. Navigate to the location where you downloaded the files for this workshop (for example: C:/Folder/)
  • You can also try this R Code:

    load(C:/Folder/States.RData)
    

Viewing the Data as a Spreadsheet

  • Let's look at our data in more depth!
  • Let's use the function View()
  • Try this R Code:
View(States)
  • Ask yourself:
    • What are the observational units of the data set?
    • What variables look interesting or unusual to you?

Introducing the Grammar of Graphics

  • ggplot2 stands for Grammar of Graphics
  • The idea behind ggplot2 is that it breaks down plotting into a set of distinct inputs
  • There are two main functions in ggplot2: qplot() and ggplot()
  • We will focus more on qplot(), which stands for “quick plot”
  • qplot() is very powerful and has a similar syntax to the base R functions

Remember: Functions in R

  • The command qplot() is a function
  • We will use functions a lot!
  • A function takes one or more inputs (raw meat), does something to these inputs (grinds it up), and gives an output (ground meat)

Meat Grinder

The Basics of the qplot Function

  • The qplot() function has the following syntax:
qplot(x, y, data, color, shape, size, facets, geom, stat)
  • x and y: the variables to plot
  • data: your data set
  • color,shape, and size: aesthetic arguments
  • facets: optional splitting (or “faceting”) into subplots
  • geom: the actual visualization of the data (such as point or line)
  • stat: any statistical summaries to be applied

Creating a Histogram: State Populations

  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", bins=10)
  • Ask yourself:
    • How would you describe the distribution of income across states?
    • What are the inputs into this function?
    • What happens if you change the number of bins = 10 to bins=20 or bins=5?

Creating a Density Plot: State Populations

  • Try this R Code:
qplot(data=States, x=Population, geom="density")
  • Ask yourself:
    • How would you describe the distribution of income across states?
    • What are the pros/cons of a density plot over a histogram?

Changing the Color

  • We use the I() function to specify a particular color
  • The I() function tells ggplot2 that this is a color, not an R object in its own right
  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", color=I("red"), bins=20)

qplot(data=States, x=Population, geom="density", color=I("red"))
  • Ask yourself:
    • What is the “color” option doing to our graphs?
    • How do you change the color to “orange”?

Changing the Fill

  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", fill=I("red"), bins=20)

qplot(data=States, x=Population, geom="density", fill=I("red"))
  • Ask yourself:
    • What is the “fill” option doing to our graphs?
    • How do you change the fill to “orange”?

Population by McCain/Obama Outcome: Color

  • We can also specify the color in terms of another variable (typically categorical)
  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", color=ObamaMcCain, bins=20)

qplot(data=States, x=Population, geom="density", color=ObamaMcCain)
  • Ask yourself:
    • Were states higher in population more likely to go for Obama or McCain?

Population by McCain/Obama Outcome: Fill

  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", fill=ObamaMcCain, bins=20)

qplot(data=States, x=Population, geom="density", fill=ObamaMcCain)

  • Ask yourself:
    • What is a problem with the “fill” option, particularly for the density plot?

Adding Transparency to the Plots

  • Try this R Code:
qplot(data=States, x=Population, geom="histogram", fill=ObamaMcCain, bins=20, alpha=I(0.5))

qplot(data=States, x=Population, geom="density", fill=ObamaMcCain, alpha=I(0.5))

  • Ask yourself:
    • What happens if you change alpha to 0.2? How about 0.8?
    • What happens if you change alpha to 1? How about 0?

Challenge #1: Educational Differences in the United States

  • How does the percent who went to high school differ across regions?
  • Create a density plot with ggplot2 to answer this question!
  • R Code Hint:
qplot(data=States, x=HighSchool, geom="density", fill=Region, alpha=I(0.5))

Check-In #1: Density Plots and Histograms

  • At this point you should have:
    • Loaded a data set into R
    • Created histograms and density plots
    • Explored the distribution of a continuous variable by levels of a categorical variable using the “fill” and “color” options
    • Altered the transparency of graphs using the “alpha” option

Part 2: Bar Plots and Faceting

Mission #2: Rich State, Poor State, Red State, Blue State

  • What are the number of “rich” states that went for Obama compared to “poor”“ states?
  • To answer this question, we'll create bar plots

Creating a Categorical Variable

  • We will use the recode command from the car package
  • You need to install the car R package and then load it
  • Try this R Code:
install.packages("car")
library("car")

States$Rich <- car::recode(States$HouseholdIncome, "lo:55000='Poor'; 55001:hi = 'Rich'", as.factor.result=TRUE)

table(States$Rich)

  • Ask yourself:
    • What are the advantages and disadvantages of converting a continuous variable to a categorical variable?

Creating a Bar Plot with a Single Categorical Variable

  • Try this R Code:
qplot(data=States, x=Rich, geom="bar")
qplot(data=States, x=Rich, color=Rich, geom="bar")
qplot(data=States, x=Rich, fill=Rich, geom="bar")
  • Ask yourself:
    • What is the difference between these plots?

Creating a Bar Plot with Two Categorical Variables

  • Try this R Code:
qplot(data=States, x=Rich, color=Region, geom="bar")

qplot(data=States, x=Rich, fill=Region, geom="bar")

  • Ask yourself:
    • What is the difference between these plots?
    • Which graph do you prefer?

Introduction to Faceting

  • Faceting entails creating sub-plots using values of one or more other variables
  • These are sometimes called a trellis graphs, since there are many sub-plots that give the appearance of a trellis
  • The syntax is row_variable ~ column_variable
  • To create a trellis graph based on a single conditioning variable, use row_variable ~ . or . ~ column_variable
  • The . indicates that we're not conditioning on the rows or columns, respectively

Faceting with Bar Plots

  • Try this R Code:
qplot(data = States, x = Region, fill = Region, geom = "bar", facets = Rich ~ .)

qplot(data = States, x = Region, fill = Region, geom = "bar", facets = . ~ Rich)

qplot(data = States, x =Region, fill = Region, geom = "bar", facets = ObamaMcCain ~ Rich)
  • Ask yourself:
    • How many states are “rich”, in the Northeast, and went for McCain?
    • How many states are “rich”, in the Northeast, and went for Obama?

Improving Plots with Axis Labels

  • Try this R Code:
qplot(data=States, x=Rich, fill=Region, geom="bar", xlab="Income Category", ylab="Number of States")
  • We can add axis labels to any qplot, not just bar plots

  • Try this R Code:

qplot(data=States, x=Population, geom="histogram", bins=10, xlab="Population (in Millions)", ylab="Number of States")

Improving Plots with a Main Title

  • Try this R Code:
qplot(data=States, x=Rich, fill=Region, geom="bar", main="Bar Plot of Income Category by Region")
  • We can add axis labels to any qplot, not just bar plots

  • Try this R Code:

qplot(data=States, x=Population, geom="histogram", bins=10, main="Histogram of Population")

Challenge #2: Rich State, Poor State, Red State, Blue State

  • What are the number of “rich” states that went for Obama compared to “poor” states?
  • Create a bar plot with ggplot2 to answer this question

  • R Code Hint:

qplot(data = States, x = Rich, fill = Rich, geom = "bar", facets = . ~ ObamaMcCain)
  • Is this finding surprising to you?
  • Be careful with interpreting aggregated data!

Check-In #2: Bar Plots and Faceting

  • At this point you should have:
    • Created a bar plot for a single categorical variable
    • Created bar plots using multiple categorical variables
    • Learned how to improve the axis labels and main title of a graph

Part 3: Boxplots

Mission #3: A Gap Between Red and Blue States?

  • A lot of researchers and pundits claim that there are fundamental differences between “red” and “blue” states
  • We will look at a number of differences
  • Between those states that went for McCain versus Obama, is there a difference in the percentage who went to college?

Creating Box Plots

  • These are also known as box-and-whisker plots
  • They're great for examining the distribution of a continuous variable by levels of a categorical variable

  • Try this R Code:

qplot(data=States, x=ObamaMcCain, y=HouseholdIncome, geom="boxplot")

qplot(data=States, x=ObamaMcCain, y=GSP, geom="boxplot")

qplot(data=States, x=ObamaMcCain, y=GSP, fill=ObamaMcCain, geom="boxplot")

Adding Points to Boxplots

  • We can also overlay the box plots wiht points using the c() function

  • Try this R Code:

qplot(data=States, x=ObamaMcCain, y=HouseholdIncome, geom=c("boxplot","point"))

qplot(data=States, x=ObamaMcCain, y=HouseholdIncome, geom=c("boxplot","jitter"))

  • Ask yourself:
    • What is the difference between using point versus jitter?

Labeling Points

  • We can also label points using the label option and by specifying text as a geom object to be plotted
  • The size option controls the size of the text labels
  • Try this R Code:
qplot(data=States, x=ObamaMcCain, y=HouseholdIncome, geom="boxplot")

qplot(data=States, x=ObamaMcCain, label=State, y=GSP, geom=c("boxplot","text")) 

  • Ask yourself:
    • Which states are outliers?

Transforming a Variable

  • We can also transform a variable before plotting
  • Often skewed distributions are logged

  • Try this R Code:

qplot(data=States, x=ObamaMcCain, y=Population, geom="boxplot")

qplot(data=States, x=ObamaMcCain, y=Population, log="y", geom="boxplot") 

Challenge #3: A Gap Between Red and Blue States?

  • Between those states that went for McCain versus Obama, is there a difference in the percentage that went to college?

  • R Code Hint:

qplot(data=States, x=ObamaMcCain, y=College, geom=c("boxplot","point"))

qplot(data=States, x=ObamaMcCain, y=College, label=State, geom=c("boxplot","text"))
  • Which states have the lowest and highest levels of education for each level of the variable ObamaMcCain?

Check-In #3: Box Plots

  • At this point you should have:
    • Generated box plots for a continuous variable across levels of a categorical variable
    • Placed points over a plot
    • Placed labels over a plot

Part 4: Scatter Plots

Mission #4: Explaining State-Level Voting

  • Our data set has several numerical variables, including HousedholdIncome, College, and NonWhite
  • Which of these variables do you think is most closely related to the percentage voting for Obama?

Scatter Plot of Two Numerical Variables

  • We can also transform a variable before plotting
  • Often skewed distributions are logged

  • Try this R Code:

qplot(data=States, x=College, y=HouseholdIncome, geom="point")
cor(States$College, States$HouseholdIncome)

qplot(data=States, x=College, y=NonWhite, geom="point")
cor(States$College, States$NonWhite)
  • Ask yourself:
    • In what way does a scatter plot reveal more information than a correlation?

Adding a Line

  • We can add a smoothed line over the graph, which includes standard errors to represent sampling uncertainty

  • Try this R Code:

qplot(data=States, x=College, y=HouseholdIncome, geom=c("point","smooth"))

qplot(data=States, x=College, y=NonWhite, geom=c("point","smooth"))
  • Ask yourself:
    • Are the relationships between these variables roughly linear?

Labeling Points

  • We can also replace the points with text labels

  • Try this R Code:

qplot(data=States, x=College, y=HouseholdIncome, label=State, geom=c("smooth", "text"))

qplot(data=States, x=College, y=NonWhite, label=State, geom=c("smooth", "text"))
  • Ask yourself:
    • Which states are outliers?
    • Are there any unusual data points?

Challenge #4: Explaining State-Level Voting

  • Our data set has several numerical variables, including HousedholdIncome, College, and NonWhite
  • Which of these variables do you think is most predictive of the percentage voting for Obama?

  • R Code Hint:

qplot(data=States, x=HouseholdIncome, y=ObamaVote, label=State, geom=c("text","smooth"))

qplot(data=States, x=College, y=ObamaVote, label=State, geom=c("text","smooth"))

qplot(data=States, x=NonWhite, y=ObamaVote, label=State, geom=c("text","smooth"))

Check-In #4: Scatter Plots

  • At this point you should have:
    • Created a scatter plot between two numerical variables
    • Generated a smoothing line over a scatter plot
    • Overlaid a scatter plot with text labels

5. Spatial Mapping

Mission #5: Which States are Red, Which States are Blue?

  • We will answer the question: which states are “red” and which states are “blue”?
  • We can answer this question by looking at the raw data, but a visualization will be much easier to interpret
  • Your goal is to create a map of the United States with information on the voting patterns

Grabbing Info on Latitude and Longitude

  • The ggplot2 function map_data() on latitude and longitude of U.S. states

  • Try this R Code:

USAMap <- map_data("state")
str(USAMap)
View(USAMap)
  • Ask yourself:
    • What kind of information is in this data set?
    • How might this data set help us create a map?

Merging the Two Data Sets

  • Our goal is to merge these two data sets, which requires a common variable

  • Try this R Code:

States$region <- tolower(States$State)

StatesMerged <- merge(x=USAMap, y=States, by = "region")

str(StatesMerged)
  • Ask yourself:
    • How many rows and columns are in StatesMerged?

Preparing for Mapmaking

  • The new data set StatesMerged also has information on latitude, longitude, and a variable called order
  • This encodes information on how to draw the state boundaries
  • We want the latitude and longitude values to be correctly sorted, so we will use the order() function
  • This just sorts the rows into ascending order so that the map will be drawn correctly

Preparing for Mapmaking (cont.)

  • Try this R Code:
StatesMerged <- StatesMerged[order(StatesMerged$order), ]
  • Ask yourself:
    • Are we sorting the rows or columns?
    • What is the name of the data set and the variable that is being sorted?

Map of Household Income

  • Now we're ready to create a map!

  • Try this R Code:

qplot(data = StatesMerged, x=long, y=lat, group = group, fill = HouseholdIncome, geom = "polygon")

  • Ask yourself:
    • Which states have the highest income? How about the lowest?

More Maps with Differing Colors

  • Let's create some more maps!
  • We'll use scale_fill_gradient() to alter the color gradient

  • Try this R Code:

qplot(data = StatesMerged, x=long, y=lat, group = group, fill = College, geom = "polygon") + scale_fill_gradient(low="red", high="green")
  • Ask yourself:
    • Which states have the highest income? How about the lowest?

Challenge #5: Which States are Red, Which States are Blue?

  • Which states are “red” and which states are “blue”?
  • If you've followed along, then you should already have the data set ready!

  • R Code Hint:

qplot(data = StatesMerged, x=long, y=lat, group = group, fill = ObamaVote, geom = "polygon") + scale_fill_gradient(low="red", high="blue")
  • Bonus:
    • Wait, why are so many states purple or pink-ish?

Check-In #5: Spatial Mapping

  • At this point you should have:
    • Loaded a data set with spatial information
    • Merged this data set with another data set
    • Created a map of U.S. states
    • Altered the color scheme of the map

Recap of the Workshop

  • At this point you should have:
    • Created density plots and histograms
    • Generated bar plots and box plots
    • Created scatter plots with smoothing lines and labels
    • Mapped social science data

Feedback Survey:

For More Information:

URL: https://compass-workshops.github.io/info/

Email List: Send an email to listserv@lists.princeton.edu with “Subscribe COMPASSWORKSHOPS” in the body and all other lines blank, including the subject

  • Free, open-source statistical programming and data analysis workshops using R and RStudio
  • Open to everyone with a Princeton ID
  • No programming experience is necessary or expected
  • Attendees should bring a laptop computer to fully participate in the workshops

Our Website

Our Mailing List

Send an email to listserv@lists.princeton.edu with “Subscribe COMPASSWORKSHOPS” in the body and all other lines blank, *including the subject*.

People

  • Teaching Staff

    • Ethan Fosse (Research Associate, Department of Sociology)
    • Yunkyu Sohn (Research Associate, Department of Politics)
  • Faculty Sponsors